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Abstract This paper provides an overview and a protocol of molecular clock dating using 
MrBayes. Two modern approaches, total-evidence dating and node dating, are demonstrated using 
a truncated dataset of Hymenoptera with molecular sequences and morphological characters. The 
similarity and difference of the two methods are compared and discussed. Besides, a non-clock 
analysis is performed on the same dataset to compare with the molecular clock dating analyses. 
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1 Introduction 


MrBayes is a software for Bayesian phylogenetic inference (Huelsenbeck and Ronguist, 
2001; Ronquist and Huelsenbeck, 2003) and has-become widely used by biologists (Van 
Noorden et al., 2014). Many new features have beenamplemented since version 3.2 (Ronquist 
et al., 2012b), including species tree inference under the 4nulti-species coalescent model (Liu, 
2008; Liu et al., 2009); compound Dirichlet priors for branch lengths'(Rannala et al., 2012; 
Zhang et al., 2012); marginal model likelihood estimation using stepping-stone sampling (Xie 
et al., 2011); topology convergence diagnostics using the average standard đeviation of split 
frequencies (Lakner et al., 2008); BEAGLE library support (Ayres et al., 2012) and parallel 
computing using MPI (Altekar et al., 2004). 

In addition to these features, the focus of this study is its new functionality of divergence 
time estimation using node dating (Yang and Rannala, 2006; Ho and Phillips, 2009) and 
total-evidence dating (Ronquist et al., 2012a; Zhang et al., 2016) approaches under relaxed 
molecular clock models (Huelsenbeck et al., 2000; Thorne and Kishino, 2002; Drummond 
et al., 2006; Lepage et al., 2007). Both dating techniques are implemented in the Bayesian 
framework, such that the diversification process and the fossil information are incorporated 
in the priors. However, they do have significant differences. In node dating, only molecular 
sequences from extant taxa are used, and one or more internal nodes of the extant taxon tree 
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are calibrated by user-specified prior distributions, typically derived from the fossil record, 
to estimate the ages of all other nodes in the tree. In total-evidence dating, extant and extinct 
taxa are all included in the tree, and morphological characters are coded for fossil and extant 
taxa in a combined matrix with molecular sequences. The age of each fossil is assigned a prior 
distribution directly. 

Several steps are involved in a Bayesian molecular clock dating analysis, importantly 
including partitioning the data, specifying evolutionary models, calibrating internal nodes 
or fossils, and setting priors for the tree and the other parameters. In this paper, I introduce a 
general clock dating protocol that could potentially be applied to a wide range of taxonomic 
groups, using a truncated Hymenoptera data as an example. 

The original data analyzed by Ronquist et al. (2012a) and Zhang et al. (2016) includes 
60 extant and 45 fossil Hymenoptera (wasps, ants, and bees) and eight outgroup taxa. The 
data Were divided into eight partitions as follows: 1) morphology, 2) 12S and 16S, 3) 18S, 4) 
28S, 5) I "and 2 4eodon positions of CO1, 6) 3" codon positions of COL (but not used in the 
analyses), 7) 1 amd 2#eodon positions of both copies of EFla, and 8) 3" codon positions of 
both copies of EFla. These are based on morphological characters, mitochondrial and nuclear 
genes. The full data takes days tố run. For illustrative purposes, the data is truncated into ten 
extant taxa (nine Hymenoptera and one Raphidioptera) and ten fossils, with 200 morphological 
characters, 100 sites from 16S, and 210 sites from EFla. Below, the dataset is analyzed by 
both total-evidence dating and node dating approach@s,under diversified sampling of extant 
taxa (Hôhna et al., 2011; Zhang et al., 2016), followed bysa.non-clock analysis under the 
compound Drichlet prior for branch lengths (Rannala et al., 2012; Zhang et al., 2012). 


2 Protocol 


2.1 MrBayes installation 


The program MrBayes is available from GitHub (https://github.com/NBISwedeń/ 
MrBayes), including pre-compiled executables for macOS and Windows, and source code for 
all platforms. The current version is 3.2.7, which is used here. The truncated dataset hym.nex 
is in the doc/tutorial folder within the release. For convenience, it is recommended 
to put it in the same directory as the executable (e.g., named mb in macOS/Linux or mb. exe 
in Windows). The file hym.nex in the NEXUS format can be opened by a text editor or 
a matrix editor such as Mesquite (https: //www.mesquiteproject.org). The data matrix is at 
the beginning, including morphological characters and molecular sequences. Following the 
data block, users can write MrBayes commands within the mrbayes block (each ends with a 
semicolon). These commands will be executed automatically when the data is read in. The 
texts within a pair of square brackets are comments and will be ignored by the program. 

In terminal (macOS/Linux) or command prompt (Windows), navigate to the folder 
containing the executable and data file using the cd command, and launch MrBayes using . / 
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mb (macOS/Linux) or mb . exe (Windows). The following header will appear. 
MrBayes 3.2.7 x86 64 

(Bayesian Analysis of Phylogeny) 

Distributed under the GNU General Public License 

MrBayes > 


The prompt at the bottom means that MrBayes is running and ready for your commands. 
Simply use 
execute hym.nex 


to read in the data. 


22 Data partitions 


When the data is read in, the commands for defining the partitions are also executed. 
chabset MV = 1-200 
chege ES = 201-300 
charset W M = 301-510 
charset Eflal2 =aœ391-510\3 302-510\3 
charset Efla3 ="303-51013 
partition four = 4: MW, @&65,, Eflal2, Efla3 


set partition = four 


Here we define four partitions: the mórphológy (MV), 16S, 1* and 2" codon positions of 
EFla (Efla12), and 3” codon positions of EFla (Efta3). The name “four” is user defined and 
can be any other string. Note the semicolon (;) at the end óf each command is neglected as the 
command is supposed to be typed after the MrBayes prompt (a. semicolon is necessary when 
the command is written in a file, the same below). 


2.3 Substitution models 


For the morphological partition, the Mkv model (Lewis, 2001) is used with variable 
ascertainment bias (only variable characters scored), equal state frequencies and gamma tate 
variation across characters. 

Iset applyto = (1) coding = variable rates = gamma 
Constant characters will be excluded automatically by the program. To be specific, use the 
following command to exclude them. 

exclude 7 31 61 83 107 121 122 133 182 183 198 
If instantaneous change is only allowed between adjacent states (e.g., 0-51 and 1-2 but not 
0-2), these characters are specified using this command. 

ctype ordered: 20 23 27 30 36 41 42 44 46 48 59 65 75 78 79 89 
99 112 117 134 146 157 159 171 185 191 193 196 
The other characters are thus unordered which can change instantaneously from one state to 
another. Dollo or irreversible character is not supported in MrBayes at the moment. 

For the molecular partitions, the general time-reversible model is used with gamma rate 
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variation across sites (GTR+T) (Yang, 1994a,b). 

Iset applyto = (2,3,4) nst = 6 rates - gamma 
The prior for the gamma shape parameter 1s exponential(1.0), which can be changed using 
prset shapepr. We keep the default here. 

Different partitions are assumed to have independent substitution parameters, thus we 
unlink them. The partition-specific rate multipliers are used to account for rate variation across 
partitions with average to 1.0. 

unlink statefreq = (all) revmat = (all) shape = (all) 


prset applyto = (all) ratepr - variable 


2.4 Relaxed clock models 


The relaxed clock models account for evolutionary rate variation over time and among 
branchéSpand it is now a standard practice to accommodate such variation in dating analyses. 
There are three relaxed clock models implemented in MrBayes: compound Poisson process 
(CPP; Huelsenbeck et al., 2000), autocorrelated lognormal (TK02; Thorne and Kishino, 
2002) and independent gamma rate (IGR; Lepage et al., 2007). However, the CPP model is 
computationally not compatible with total-evidence dating, thus we only focus on the IGR and 
TK02 models. The outcomes of using these two models are discussed below. 

The mean clock rate (mean substitutioń rate per site per Myr) is assigned a lognormal(-7, 
0.6) prior, with mean 0.001 and standard deviation 0.0007. 

prset clockratepr = lognorm(-7,0.6) 
The prior is chosen by comparing the age of the oldest insect fossil.with the root age estimation 
from uncalibrated clock analysis as discussed in Ronquist et al.(2012a). One might need 
to specify a different suitable distribution. There are several options fer the clock rate prior, 
including fixed, normal (truncated at zero), Lognormal, and gamma. The probability 


density functions (all with mean 0900) and 
8 - — lognormal(-7, 0.6) standard deviation 0.0007) are shown in Fig, IX 
= — gamma(2, 2000) | i A i 
— normal(0.001, 0.0007) The differences in this case are minor. 
ae The relative clock rates, which are mul- 
KZ tiplied by the mean clock rate, vary differently 
= g | along the branches of the tree in different models. 
y The TK02 model assumes that the relative rate 
5 changes along the branches as Brownian motion 
sd on the log scale, starting from 1.0 (0.0 on the log 
scale) at the root. The rate at the end of a branch 
AAA is lognormally distributed with mean equal to the 
a AA o s 0904 000 rate at the beginning of the branch. Thus the rates 
Fig. 1 Probability density functions at different branches are autocorrelated. 
of lognormal, gamma, and normal distributions, prset clockvarpr = tk02 


all with mean 0.001 and standard deviation 0.0007 prset tk02varpr = exp(1) 
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The IGR model assumes that the relative rate at each branch is an independent gamma 
distribution with mean 1.0. The variance is proportional to the branch length. Thus the rates are 
independent of each other (but not identically distributed). 

prset clockvarpr = igr 

prset igrvarpr = exp (10) 
Note that when the relative rates are all fixed to 1.0 in the TK02 or IGR model, it becomes the 
strict clock model. 

Before the total-evidence dating and node dating analyses, we first define the fossil 

taxa and some constraints for later use. Note these constraints are not enforced until we set 
topologypr explicitly (see below). 


taxset fossils = Asioxyela Nigrimonticola Xyelotoma Undatoma 


Dahuratoma Cleistogaster Ghilarella Mesorussus Prosyntexis 


Psemdoxyelocerus 
constraint+HymenFossil = 2-. 
constrainteHŷmenoptera = 2-10 
constraint HoPometabola = 1-10 
constraint Tenthredińidae = 3-5 


constraint CepSirOruApo/ =) 7720 
The numbers here refer to the taxon numberssordered as in the data matrix, while a dot (.) 
means the last taxon. 


2.5 Total-evidence dating 


In the following, we assign priors for the fossils from the geological times. This is a 
typical step in total-evidence dating, where we calibrate the fossil taxa instead of the internal 
nodes of the tree. Fossil age uncertainties are represented as uniform distributions, while fixed 
values can be used when the geological age is very certain. 

calibrate 
Asioxyela = unif(228,242) 
Nigrimonticola = unif (152,163) 
Xyelotoma = unif (152,163) 
Undatoma = unif (145,152) 
Dahuratoma = fixed (134) 
Cleistogaster = unif (168,191) 
Ghilarella = unif(113,125) 


Mesorussus = unif (94,100) 


Prosyntexis = unif (80,86) 


Pseudoxyelocerus = fixed (182) 


prset nodeagepr = calibrated 


To enable the calibrations, the last nodeagepr command must be present. 
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The speciation, extinction, fossilization, and sampling process is explicitly modeled using 
the fossilized birth-death (FBD) process (Stadler, 2010; Heath et al., 2014; Gavryushkina et al., 
2014; Zhang et al., 2016). 

prset brlenspr = clock:fossilization 


Diversified sampling (Zhang et al., 2016) is arguably suitable for such higher-level taxa, which 
assumes exactly one representative extant taxa per clade descending from time x,,, is sampled, 
and the fossils are sampled with a non-zero rate before x,„„ and zero after. A fossil can be either 
a tip or a sampled ancestor (Fig. 2). 

prset samplestrat = diversity 
By default, the sampling strategy for extant taxa is uniformly at random (samplestrat = 
random) which is not used here, but see Zhang et al. (2016) for the application. 


A complete tree B diversified sampling 


t 


mrca 


Fig.2 The fossilized birth-death (FBD) process and diversified sampling of extant taxa 
ew 18 sampled (blue dots). The fossils are 
and x,,, (red dots) 


Exactly one representative taxa per clade descending from time x 
sampled with a constant rate between £, 


mrca 


The model has four parameters: speciation rate 2, extinction rate u, fossil discover rate y, 
and extant sample proportion p. For inference, we re-parametrize the parameters as'd 34 & u, 
r=ul/ 4, and s = wy / ( + y), so that the latter two parameters range from 0 to 1. p is fixed to 
0.0001, based on the living number of Hymenoptera at about 100000 (10/100000 = 0.0001). 
prset sampleprob - 0.0001 


prset speciationpr = exp(10) 
prset extinctionpr = beta(1,1) 
prset fossilizationpr = beta (1,1) 
prset treeagepr = offsetexp (300, 390) 
The FBD prior is conditioned on the root age (/„„„). Here we use an offset exponential 


distribution with minimal age 300 Ma (according to the oldest neopteran fossil) and mean age 
390 Ma (the oldest insect fossil). Users should specify a reasonable root age prior according to 
the available fossil information. Several available probability densities all with mean 390 and 
minimal 300 are shown in Fig. 3. The exponential prior is more diffused than the rest three. 

An alternative tree prior that can be used in the total-evidence dating approach is the 
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uniform prior (Clock: uniform) (Ronquist 


0.012 


— offsetlognormal(300, 4, 1) 

— offsetgamma(300, 2, 0.022) 
— offsetexp(300, 0.01111) 

— truncatednormal(300, 390, 60) 


et al., 2012a). It assumes that the internal nodes 


are draw from uniform distributions and the 


0.01 


fossils are only tips of the tree (so-called tip 


> 8 
dating). Similarly to the FBD prior, the uniform 2 5 
prior is also conditioned on the root age and > 8 
reguires setting treeagepr. 8 z 
In order to root the tree properly, we enforce the 3 
constraint HymenFossil defined above. This 8 
forces the Hymenoptera with fossils to form a s 
paonophyletic group. ° 300 350 400 450 500 550 600 
mr set topologypr = age 
codStBajint HymenFossil) Fig. 3 Probability density functions of offset 
For the Markov chain Monte Carlo lognormal, offset gamma, offset exponential, 
(MCMC) (Metropolis et al., 1953, Hastings, UG I distributions, 


8 all with mean 390 and minimum 300 
1970), we use two independent runs and four 


chains (one cold and three hot) perrun for 500,000 iterations and sample every 100 iterations. 
mcmcp nrun = 2 nchain #44 ngen = 500000 samplefr = 100 
mcmcp filename = hym.te printf», = 1000 diagnfr = 5000 
mcmc 
The output file names all start with hym. te (represented ash ym. te. *). The chain states are 
printed to screen every 1000 iterations, and the convergence diagnostics are printed every 5000 
iterations. The mcmc command executes the MCMC run. 

While the MCMC is running, the iteration number, likelihood Walues, and average 
standard deviation of split frequencies (ASDSF) are printed to the screen. The ASDSF should 
be decreasing toward 0, indicating the tree topologies sampled from the two different runsrafe 
getting similar and converging to the same (stationary) distribution. Typically, ASDSF < 0,04 
is a good sign of consistency between runs. 

To summarize the parameters and trees after the MCMC, use 

sump 

sumt 
By default, the first 25% samples are discarded as burnin. This can be adjusted according to 
the likelihood traces from the two runs (e.g. viewed in Tracer). The effective sample size (ESS) 
is an important indicator to judge if sufficient MCMC samples are collected. Ideally, the ESS 
should be larger than 100 for all parameters. We may need to increase the chain length and 
adjust MCMC proposals to improve the estimates. 

The consensus tree including all fossils is highly unresolved due to the uncertainty in the 
placement of the fossils. In order to display the node ages clearly, we can redraw an extant taxa 
tree by pruning the fossils. The output filename is changed to avoid overwriting the existing ones. 
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delete fossils 

sumt output = hym.rf 
Note here this sumt command does not rerun the MCMC but just for displaying the extant 
taxon tree. 


2.6 Node dating 


In node dating, we calibrate the internal nodes of the tree instead of the fossils above. The 
morphological characters of the fossils are not used, thus we remove them. 
delete fossils 
exclude 24 130 168 
calibrate 
Tenthredinidae = offsetgamma (100,150,25) 
GepSirOruApo = truncatednormal (140,175,25) 
prset nodeagepr = calibrated 
The birth-death prior under diversified sampling (Höhna et al., 2011) is used for the time 
tree. Compared to the EBD prior used above, there is no fossil sampling parameter in this case. 


The priors for the root age, speciation and extinction rates are not changed. 
prset brlenspr = clock:birthdeath 


prset samplestrat - divefsity 


prset sampleprob = 0.0001 
prset speciationpr = exp(10) 
prset extinctionpr = beta(1,1) 
prset treeagepr - offsetexp (300, 390) 
The uniform tree prior (clock: uni form) is also applicable. 


It is important to force the calibrated clade to be monophyletic and enable the.constraints. 

The constraint Hymenoptera helps to root the tree properly. 
prset topologypr = constraint (Hymenoptera, Tenthredimidae, 
CepSirOruApo) 

The output filename is changed to avoid overwriting the existing ones. If the node dating 
analysis is continued after the total-evidence dating in the same MrBayes session, the starting 
values also need to be reset. The other settings are kept the same as in the total-evidence dating 
analysis. 


mcmcp filename = hym.nd startp = reset startt = random 


2.7 Non-clock analysis 


Lastly, we run an analysis without the molecular clock assumption and calibrations. The 
branch lengths are measured by genetic distance (expected substitutions per site). This is a 
typical analysis most people do using MrBayes. We do not use fossils, and do not constrain the 
topology so that they are uniformly distributed. 

delete fossils 


prset brlenspr = uncons:gammadir(1,1,1,1) 
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prset topologypr = uniform 

mcmc filename = hym.un 

sump 

sumt 

The prior for branch lengths is gamma-Dirichlet(1, 1, 1, 1) (Rannala et al., 2012: 

Zhang et al., 2012), which assigns gamma(1, 1) for the tree length and uniform Dirichlet for 
the proportion of branch lengths. The compound Dirichlet prior was shown to help avoid 
overestimating the tree length (Zhang et al., 2012) and is now the default prior in MrBayes 
(since 3.2.3). 


3 Results and discussion 


The posterior estimates of the parameters are summarized in separate files, which can 
be opened using a text editor. The information is also printed to the screen. The partition 
rate multipliers are in hym.“ .pstat (here * is to match te and nd, the same below). The 
morphologies (mt fk) and 16S (m(2)) partitions evolve at similar rate. The 1“ and 2" codon 
positions of Efla (m{39) evolve much more slowly than the 3" codon position (mf 4 )). Thus 
it is reasonable to take into account such significant rate variation across partitions. 

The majority-rule consensus4rees aressummarized in hym. *.con. tre, which can 
be visualized by FigTree (http://tree.bio-edtac/fik/software/figtree) or IcyTree (https://icytree. 
org). The node ages are also in hym. * .vstaf, associated with the bipartition IDs in 
hym.*.parts. The root ID is 0 which includes all taxa, While the ID of Hymenoptera is that 
which excludes the outgroup Raphidioptera ( . 444444444). The extant taxa trees from total- 
evidence dating and node dating under diversified sampling and IGR eloel model are shown 
in Fig. 4. The topologies are the same in general, except for the Cephus and Sirex clade. The 
age of Hymenoptera (250 Ma) inferred from total-evidence dating is similar to that from node 
dating, but with a relatively narrower HPD interval. 

The total-evidence dating approach models the fossilization and sampling process 
explicitly, and incorporates different sources of information from the fossil record while 
accounting for the uncertainty of fossil placement. In comparison, the node dating approach 
discards the fossil morphologies, and relies on secondary interpretation of the fossil record as 
node calibrations. Total-evidence dating appears more objective and rigorous, and provides an 
ideal platform for exploring and further improving the models used for Bayesian molecular 
clock dating analysis. 

Comparing the non-clock tree (Fig. 5) with the clock trees (Fig. 4), it is obvious that the 
evolutionary rate is not constant over time. The Xye/a and Onycholyda branches evolve much 
more slowly than the Raphidioptera and Orussus/ Vespidae clade, and there are indeed dramatic 
rate changes between adjacent branches. In this case, the IGR relaxed clock model is more suitable 
than the autocorrelated TK02 model, and it is not reasonable to assume a strict clock model. 
Apart from these perspectives, a more statistically rigorous model selection procedure (e.g., using 


reversible-jump MCMC or marginal likelihood) needs to be carried out in future studies. 
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A Raphidioptera 
Xyela 
Onycholyda 
Runaria 

Diprion 
Phylacteophaga 
Cephus 

Sirex 

Orussus 


Vespidae 


B Raphidioptera 
Xyela 
Onycholyda 
Runaria 
Diprion 
Phylacteophaga 
Cephus 
Sirex 
Orussus 


Vespidae 


400 350 300 250 200 150 100 50 0 Ma 


Fig.4 Majority-rule consensus trees of extant taxa from total-evidence dating (A) and node dating (B), 
under the diversified sampling and IGR models 
The node heights are in units of million years and the error bars indicate the 95% HPD,intervals 
The numbers at the internal nodes are the posterior probabilities of the corresponding clades 


Raphidioptera 


Xyela 


Onycholyda 


Runaria 


Diprion 


Phylacteophaga 


Cephus 


Sirex 


Orussus 


Vespidae 


0.1 


Fig.5 Majority-rule consensus tree of extant taxa from a non-clock analysis under the gamma-Dirichlet prior 
The branch lengths are measured by expected substitutions per site 
The numbers at the internal nodes are the posterior probabilities of the corresponding clades 
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In conclusion, this study provides a brief overview and comparison of total-evidence 
dating and node dating analyses, and demonstrates the functionality of MrBayes using a 
truncated dataset of Hymenoptera. For the analyses and discussion using the full data, please 
see Ronguist et al. (2012a) and Zhang et al. (2016). 
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